i 



WORLD INTELLECTUAL PROPERTY ORGANIZATION 
International Buteau 




PCX 

INTERNATIONAL APPUCATION PUBUSHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification ; 
C12Q 1/68, C12N 15/10 



Al 



(11) International Publication Number: WO 00/53806 

(43) International Publication Date: 14 September 2000 (14.09.00) 



(21) International Applicatiun Number: PCT/IB99/00502 

(22) International Filing Date: 10 March 1999 (10.03.99) 



(30) Priority Data: 
09/263 J 96 



S Match 1999 (OS.03.99) 



US 



(71) Applicant: CHUGAI PHARMACEUTICAL CO. LTD. 

[JP/JPl; 1-9 Kyobashi 2r^ome, Chuo-ku» Tokyo 
104-8301 (JP). 

(72) Inventor: SPINELLA/ Dominic, G.; 7026 Via Calafia, La 

Costa, CA 92009 (US). 



(81) Designated States: AU, CA, JP, KR, European patent (AT, 
BE. CH. CY. DE. DK, ES, PI, FR, GB, GR, IE. IT, LU, 
MC, NL. PT. SE). . 



Published 

With international search report. 



(54)TiUe: METHOD OF IDENTIFYING GENE TRANSCRIPTION PATTERNS 
(57) Abstract 



This invention provides a rapid, artifact free, improved method of obtaining short DNA "tag", or anays diereof. allowing for 
determination of the relative abundance of a gene transcript within a given mRNA population and is useful to identiiy patterns of gene 
transcription, as well as identify new genes. 



V 



FOR THE PURPOSES OP INFORMATION ONLY 
Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT. 



AL 


Albania 


ES 


Spain 


L5 


Lesotho 


SI 


Slovenia 
Slovakia 


AM 


Atmenta 


n 


Finland 


LT 


Lithuania 


SK 


AT 


Austria 


FR 


France 


LU 


Luxemboufg 


SN 


Senegal 


AU 


Ausirelia 


GA 


Gabon 


LV 


Latvia 


sz 


Swaziland 


AZ 


Azerbaijan 


GB 


Untied Kiiigdooi 


MC 


Monaco 


TD 


Chad 


BA 
BB 
BE 


Bosnia and Heneigoviiia 


G£ 


Geoista 


MD 


Republic of Moldova 


TG 


Togo 


Barbados 


GH 


Ghana 


MG 


Madagascar 


TJ 


Tajikistan 


Bclghun 


GN 


Guinea 


MK 


The fomier Yugoslav 


TM 


Turioncnistan 


BF 


Bnrtcina Fiso 


GR 


Greece 




Republic of Macedonia 


TR 


Turicey 


BC 


Butgaria 


HU 


Hunsuy 


ML 


Mall 


TT 


Trinidad and Tobago 


BJ 


Benin 


IB 


Ireland 


MN 


Mongolia 


UA 


Uknbe 


BR 


Brazil 


IL 


Iinel 


MR 


Mauritania 


UG 


Uganda 


BY 
CA 


Belarus 


IS 


Iceland 


MW 


Malawi 


US 


United States of America 


Canada 


IT 


luly 


MX 


Mexico 


UZ 


Uzbekistan 


CF 


Cemral African RepoUic 


JP 


Japan 


NE 


Niger 


VN 


Viet Nam 


CG 


Congo 


KE 


Kenya 


NL 


Neiheriands 


YU 


Yugoslavia 


CH 


Swltxerhind 


KG 


Kyrgyzstan 


NO 


Norway 


zw 


Zimbabwe 


a 


CAte d'lvoiit 


KP 


Democraiic People*! 


NZ 


New Zealand 




CM 


CsnusfDon 




Republic of Koiea 


PL 


Poland 






CN 


China 


KR 


Republic of KoieA 


PT 


Portugal 
Romania 






cu 


Cuba 


KZ 


Kazaksian 


RO 






C2 


Czech Republic 


LC 


Saint L4icia 


RU 


Rttinan Fedentian 






DE 


Gemiony 


U 


Licchtensteui 


SD 


Sudan 






DK 


Denmark 


LK 


Sri Lanka 


SE 


Sweden 






EE 


Bsionta 


LR 


Liberia 


SG 


Singiyxwe 







/ 



wo 00/53806 



PCT/IB99/005G2 



1 

DESCRIPTION 

METHOD OF IDENTIFYING GENE TRANSCRIPTION PATTERNS 

5 Cross reference to related aPDlications 

This application is a continuation in part of pending U.S. application 08/784.208, 
filed January 15, 1997 

Field of the Invention 

1 0 This invention relates to a method of identifying gene transcription patterns 

in a cell or tissue. 
Background of the Invention 

Expressed Sequence Tag (EST) programs have provided DNA sequence 
information for a substantial proportion of expressed human genes (Fields, C. et 
15 al.. Nature Genetics 7: 345-346 (1994)) in the human genome. However, DNA 
sequence infonnation alone is insufficient for a complete understanding of gene 
function and regulation. 

Because only a fraction of the full genetic repertoire is expressed in a cell 
at any given time, and because gene expression effects cell phenotype, tools to 
20 qualitatively and quantitatively monitor gene transcription are needed. 

Classical qualitative and quantitative techniques such as northem blotting 
and nuclease protection assays are accurate and quantitative, but cannot provide 
information quickly enough to generate global gene expression profiles. 

More recent approaches include sequence analysis of random isolates from 
25 cDNA libraries, Polymerase Chain Reaction (PGR) and hybridization-array-based 
methodologies, but each of these methods has limitations. 

High-density microarray hybridization of RNA or cDNA corresponding to 
known genes (Ramsay, G. Nature Biotechnoloov 16: 40-44 (1998)) is a fast 
method for parallel analysis of global gene expression. This method, however, is 
30 limited to known genes and the number of genes in a single microarray is limited 
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as well. 

Sequencing of random isolates from cDNA libraries to generate ESTs 
provides quantitative results, but is a daunting task. (Adams, M.D., et al.. Science 
252: 1651-1656 (1991); Adams. M.D. et al.. Nature 355: 632-634 (1992)). Within 
cDNA libraries, the frequency of a cDNA clone should be proportional to the 
steady-state amount of that transcript In the RNA population of the cell or tissue 
from which the RNA was derived. (Okubo K. et aL. Nature Genetics 2: 173-179 
(1992): Lee. N.H. et al., Proc Natl Acad Sci USA 92: 8303-8307 (1995)). This 
approach, however, requires DNA sequencing iefforts beyond the capacity of most 
laboratories. 

PCR-based methods can generate DNA fragments from mRNA pools which 
differ in size and sequence enabling their separation and identification to form an 
expression profile. Profiles from different cell or tissue populations to detect 
differentially expressed genes. This method has been used to establish databases 
of mRNA fragments. (Williams, J.G.K.. NucL Acids Res. 18:6531 (1990); Welsh, 
J., et al. Nucl. Acids Res.. 18:7213 (1990); Woodward. S.R„ Mamm. Genome. 
3:73 (1992); Nadeau. J.H., Mamm. Genome 3:55 (1992)). Some have sought to 
adapt these methods to compare mRNA populations between two or more 
samples (Liang, P. et ai. Science 257:967 (1992); See aiso Welsh. J. et al.. Nucl. 
Acid Res. 20:4965 (1992); Liang. P.. et al.. Nucl. Acids Res. . 3269 (1993). and WO 
95/13369, Published May 18, 1995. Differential Display and Amplified Fragment 
Length Polymorphism (AFLP) (Liang P. and Pardee, A.B. Science 257: 967-971 
(1992)), (Vos. P. et al.. Nucleic Acids Res. 23: 4407-4414. (1995)), for example, 
can provide gene expression infoonation at the appropriate speed and scale, but 
these methods can suffer from a lack of precision and reproducibility due to their 
susceptibility to quantitative PGR artifacts. 

Recently, a variation of PGR for a random cDNA sequencing approach was 
described by Velculescu et al. (Velculescu, V. E. et al.. Science 270:484 (1995)). 
This technique, called Serial Analysis of Gene Expression (SAGE), generates 
short, defined sequences from cDNAs which are randomly ligated in a tail-to-tail 
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fashion and amplified by PGR to form "di-tags". These di-tags are then 
concatenated into arrays which are cloned and analyzed by DNA sequencing. 
Because each sequencing template contains identifiable tags corresponding to 
many genes, the potential throughput of SAGE exceeds traditional cDNA 
sequencing, allowing gene transcription profiling in many laboratories. 

However, the results for SAGE, like any other PGR process is influenced by 
factors other than starting template abundance. Sequence-specific differences in 
^'amplification efficiency" are known to give rise to artifactual differences in product 
yield. That is. the quantity of PGR product may differ in the absence of real 
differences in starting template. For example, amplification of the same template 
preparation produces product yields that can vary by as much as 6-fold (Gilliand 
et al. PGR Protocols. Academic Press, pp 60-69 (1990)). Hence, any PGR-based 
method that attempts to infer starting template abundance from the quantity of 
product generated by amplification requires stringent co-amplification controls. 

Thus, there is a need for a simple and reproducible method for detecting 
and quantifying gene transcription, identifying genes, and gene transcription 
patterns and frequency in individual cells or tissues, which is free from PGR and 
other artifacts, provides for unknown genes, and yet is fast enough to allow speedy 
detection and comparison between samples. 

In order to circumvent the problems found in the art, we have developed a 
cDNA tag-based technique called TALEST (Jandem Arrayed Ligation of 
Expressed Sequence lags) that avoids PGR amplification artifacts. The technique 
provides a -25-fold increase in throughput relative to random cDNA sequencing 
approaches to gene expression profiling. 
Summarv of the Invention 

This invention provides an improved method of obtaining short DNA "tag" 
sequences which allows for determination of the relative abundance of a gene 
transcript within a given mRNA population. 

This invention provides a method of obtaining an array of tags, 

This invention provides a method of identifying patterns of gene 
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transcription. 

This invention provides a method of detecting differences in gene 
transcription between two or more mRNA populations. 

This invention provides a method of determining the frequency of individual 
gene transcription in an mRNA population. 

This Invention provides a method of screening for the effects of a drug on 
a cell or tissue. 

This Invention provides a method of detecting the presence of a stress, 
whether disorder, disease, the onset or proceeding of development or 
differentiation, exogenous substance (chemical, cofactor, biomolecule or drug), 
condition (Including environmental conditions, such as heat, osmotic pressure, or 
the like), receptor activity (whether due to a ligand in a receptor or otherwise), 
abberant cellular condition (including mutation, unusual copy number or the like) 
in a target organism. 

This invention provides a method of isolating a gene. 

This invention provides a kit for obtaining a tag or an array of tags. 
Detailed Description of the Invention 

To aid the skilled artisan in understanding this invention the following 
definitions are provided, where they deviate from the tenns commonly used in the 
art. . • 

"A pattern of gene transcription" as used herein, means the set of genes 
within a specific tissue or cell type that are transcribed or expressed to fonn RNA 
molecules. Which genes are expressed in a specific cell line or tissue and at what 
level the genes are expressed will depend on factors such as tissue or cell type, 
stage of development of the cell, tissue, or target organism and whether the cells 
are normal or transformed cells, such as cancerous cells. For example, a gene is 
expressed at the embryonic or fetal stage in the development of a specific target 
organism and then becomes non-expressed as the target organism matures. Or, 
as another example, a gene is expressed in liver tissue but not in brain tissue of 
an adult human, in another example, a gene is expressed at low levels in normal 
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^ lung tissue but is expressed at higher levels in diseased lung tissue. 

"A punctuating restriction endonuclease" as used herein, means a 
restriction endonuclease having a probability of recognizing a sequence within 
each copy of cDNA. Preferably, the punctuating endonuclease recognizes a 
5 sequence consisting of less than six bases. More preferably the punctuating 
endonuclease recognizes a sequence consisting olfour bases. Most preferably, 
the punctuating endonuclease is Mspl, Haelll or SauSal, or any isoschizomer 
thereof. 

"A Type lis restriction endonuclease" as used herein, means a restriction 

10 endonuclease allowing DNA cleavage at a site in the DNA distant from the 
recognition sequence for the restriction endonuclease. Preferably, the Type IIS 
restriction endonuclease recognizes four to seven bases and cleaves the adjacent 
DNA 10-18 bases 3' to the recognition sequence, also preferably greater than 10 
bases. Hence "distant" means within the range of type lis or type lls-Iike restriction 

15 endonucleases. Most preferably, the type Ms restriction endonuclease is Bsgl, 
BseRI, Fokl. BsmR. 

"A 5' cloning restriction endonuclease" as used herein, means a restriction 
endonuclease having a corresponding methylase or other protection means that 
can protect any DNA from cleavage by the enzyme. Preferably, the 5' cloning 

20 restriction endonuclease recognizes a sequence of between about four to ten 
bases. Examples of this most preferably, the 5' cloning restriction endonuclease 
isEcoRI,BamH1,Hindlll 

"A 3' cloning restriction endonuclease" as used herein, means a restriction 
endonuclease having a recognition sequence that appears infrequentiy in the 

25 human genome. Preferably, the 3* cloning restriction endonuclease recognizes a 
sequence consisting of six or more bases, preferably more than six bases. Mpre 
preferably the 3* cloning restriction endonuclease recognizes a sequence 
consisting of eight or more bases containing a CG dinucleotide within it. More 
preferably, the 3' cloning restriction endonuclease provides a cleavage end that 

30 does not easily ligate to the cleavage end generated by the 5' cloning restriction 
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enzyme. An example of a most prefen-ed is the 3' cloning restriction endonuclease, 
Notl. 

"A 3'-most cDNA fragment" as used herein, means a fragment of double- 
stranded cDNA, transcribed from an mRNA population of interest from an oligo dT 
primer, which is preferably biotinylated. consisting of that portion of the full length 
cDNA between the 3'-most punctuating restriction endonuclease site, and the 3' 
tenminus of the cDNA. The 3 -most cDNA fragment may be isolated on a solid 
phase matrix containing streptavidin so as to ease its separate the fragment from 
other cDNA fragments produced by digestion with the punctuating restriction 
endonuclease. 

"A first cDNA construct" as used herein, means . a cDNA construct 
comprising the 3-most cDNA fragment ligated to a 5' adapter. A 5' adapter is 
ligated to the 5' end of the cDNA fragment providing the first cDNA constmct. It 
is preferred that this 5' adapter would provide suitable recognition sites for 
endonucleases envisioned to be used therein, and it is also preferable to provide 
sufficient molecular weight for resolution fiiom tags, of course these requirements 
change with choice of enzyme, staining method for resolution, and the like. The 5' 
adapter may be biotinylated allowing the first cDNA construct to be captured on 
a solid phase matrix containing streptavidin so as to ease its isolation. 

"A second cDNA construct" as used herein, means a cDNA constmct 
provided by cleaving the first cDNA construct with a type Us restriction 
endonuclease which recognizes a sequence within the 5' adapter but cuts the DMA 
within the cDNA fragment 

"A third cDNA constaict" as used herein, means a cDNA construct 
comprising the second cDNA constaict and a 3' adapter. A 3* adapter is ligated 
to the 3' end of the second cDNA construct providing the third cDNA constmct. 
Preferably, the third cDNA constmct is biotinylated and may be captured on a solid 
phase matrix containing streptavidin so as to ease its isolation. 

"A fourth cDNA construct" as used herein, means a cDNA constmct 
provided by cleaving the third cDNA constmct with the 5' cloning restriction 
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endonucleases and 3' cloning restriction endonuclease which recognizes sites 
located in the 5' and 3* adapters, respectively. 

"A 5' adapter" as used herein, means an adapter consisting of a double- 
stranded poiydeoxyribonucleotide containing a recognition sequence for a type lis 
5 restriction endonuclease. The 5' adapter is ligated to the 5' end of the cDNA 
fragment(s} generated by cleavage of cDNA with the punctuating restriction 
endonuclease. Preferably, the 5' adapter further contains a single-stranded 
overhang sequence compatible with thef overhang sequence produced by 
cleavage of a cDNA fragment with the punctuating restriction endonuclease. 

1 0 ("Overhang," as used herein, is defined as the effect of having a double stranded 
DNA, or a DNA/RNA strand that, while largely double stranded, has at one or both 
ends one or more unpaired or dangling bases on one or both strand, which would 
be paired, but for the fact that there is no complement on the other strand. 
Preferably, these occur where one strand has an overhang on one end and the 

1 5 other strand has an overhang on the other end of the double stranded DNA.) More 
preferably, the 5' adapter further contains a recognition sequence for a 5' cloning 
restriction endonuclease located 5* to the recognition sequence for the type lis 
restriction endonuclease. In a 5* adapter, the 5' cloning restriction endonuclease 
recognition sequence is located greater than about four, preferably greater than 

20 about 10, more preferably greater than about twenty, most preferably greater than 
about 30, and preferably less than about 90, more preferably less than about 70, 
most preferably less than about 60, nucleotides 5' to the type lis restriction 
endonuclease cleavage sequence. By ligating to a cDNA fragment, a 5' adapter 
re-creates a recognition sequence for the punctuating restriction endonuclease. 

25 Sense strand of the 5' adapter has preferably a sequence shown in SEQ ID N0.1 
and antisense strand of the 5' adapter has preferably a sequence shown in SEQ 
ID NO. 2. 

"A 3' adapter" as used herein, means an adapter consisting of a double- 
stranded poiydeoxyribonucleotide for ligation to the 3' end of a second cDNA 
30 construct providing a third cDNA construct. Preferably, a 3* adapter comprises a 
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degenerate single-stranded end compatible with all possible ends of the second 
cDNA construct produced by a digestion with the type lis restriction endonuclease. 
More preferably, a 3' adapter further contains a recognition sequence for the 
punctuating restriction endonuclease located 3* to the degenerate end. In the 3' 
5 adapter, the recognition sequence of the punctuating restriction endonuclease is 
preferably adjacent to the degenerate single-stranded end compatible with ends 
of the second cDNA constmct produced by the type lis restriction endonuclease. 
Preferably, a 3* adapter further comprises a recognition sequence for a 3' cloning 
restriction endonuclease located 3' to the recognition sequence for the punctuating 
10 restriction endonuclease. Preferably, the 3* adapter contains one or more biotin 
molecules located 3' to the recognition sequence for the 3' cloning restriction 
endonuclease where the 3* restriction endonuclease is not the sense strand of the 
3' adapter comprises all or part of SEQ ID N0.3 preferably is SEQ ID NO. 3 and 
antisense strand of the 3' adapter comprises all or part of SEQ ID NO. 4, and more 
15 preferably Is SEQ ID NO. 4. 

"A tag" as used herein, means a DNA a sequence .consisting of: 
(1) preferably double-stranded 10-14. deoxyribonucleotides 
corresponding to a cDNA sequence located proximal to the 3'-most punctuating 
, restriction endonuclease site in the cDNA. 
20 (2) a double-stranded base-pair flanking both ends of the cDN A-derived 

sequence which is itself derived from the recognition sequence from the 
punctuating restriction endonuclease, and optionally 

(3) single-stranded overhang sequences derived from the punctuating 
restriction endonuclease generating self-compatible cohesive ends. 
25 Before amplification, one tag represents one copy of mRNA and no two 

same tags are created from one copy of mRNA. A tag comprises cDNA 
sequences. Preferably, the tags are ligated together using DNA ligase. Treatment 
of tag ends may be either blunt or cohesive with overhangs. For reasons the 
skilled artisan will appreciate, an overhang is preferred. More preferably the tags, 
30 when ligated together, regenerate the recognition sequence for the punctuating 
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restriction endonuclease allowing the recognition of discrete cDNA-derived 

sequences owing to their separation by the punctuating sequence. 

"A tag sequence" as used herein, means DNA sequence comprising at 

least one tag sequence. An array of tags includes a number of tags, having one 
5 or more sequences. The tag sequence can be included in linear oligonucleotide, 

in a vector or the like. 

. "A cDNA tag library" as used herein, means a cDNA library prepared in a 

vector comprising (1 ) llgated fragment of a 5' adapter cleaved with a 5' cloning 

restriction endonuclease, (2) ligated fragment of 3' adapter cleaved with a 3' 
10 cloning restriction endonuclease, and (3) a tag. A cDNA tag library is cloned into 

a cloning vector and can be amplified in a host cell. 

"An array of tags" as used herein, means tags ligated with their cohesive 

ends or blunt ends as to form double-stranded DNA sequence. The anray of tags 

is also referred to as "a concatemer". which means concatenated tags. The array 
15 of tags can be included in a vector. Preferably, an array of tags comprises cDNA 

sequences interspersed by recognition sequences for a punctuating restriction 

endonuclease. The ligated an-ays of tags comprise approximately at least 10 tags. 

preferably at least 30 tags, more preferably at least 40 tags and less than 70 tags. 

more preferably less than 60 tags, most preferably less than about 51 tags. More 
20 preferably, an an^ay of tags begins with and ends with a recognition sequence for 

the punctuating restriction endonuclease. Most preferably, the recognition 

sequence for the punctuating restriction endonuclease is located between each 

tag. 

"A punctuation sequence" as used herein, means a sequence formed by 
25 ligating two ends digested with a punctuating restriction endonuclease as detected 
in a sequence of an array of tags which punctuates DNA nucleotide sequences. 

*'A clamp" as used herein, means a base-pair derived from the recognition 
sequence from a punctuating restriction endonuclease which remains attached to 
the cDNA-derived sequence when a tag is generated by digestion with the 
30 punctuating restriction endonuclease. The base composition of a clamp preferably 
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consists of guanine (G) and cytosine (C), and is referred to as "a GC-clamp", which 
means a clamp consisting of G and C. The function of a GC-clamp is to enhance 
thermal stability of a tag by increasing the number of hydrogen bonds which hold 
the anti-parallel strands of a tag together after digestion with the punctuating 
restriction endonuclease. 

"GC rich" as used herein, means a sequence in which the percentage of 
bases which are G or C is more than 40%, preferably more than 50%. 

"Correspond," as used herein, means that at least a portion of one nucleic 
acid molecule is either complementary to or identical to a second nucleic acid 
molecule. Thus, a cDNA molecule may con-espond to the mRNA molecule where 
the mRNA molecule was used as a template for reverse transcription to produce 
the cDNA molecule. Similarly, a genomic sequence of a gene may conrespond to 
a cDNA sequence where portions of the genomic sequence are complementary 
or identical to the cDNA sequence. 

"Hybridize" as used herein, means the iformation of a base-paired 
interaction between nucleotide polymers. The presence of base pairing implies 
that a fraction of the nucleotides (e.g., at least 80% of a group of adjacent bases 
in a nucleotide) in each of two nucleotide sequences are complementary to the 
other according to the commonly accepted base pairing rules. The exact fraction 
of the nucleotides which must be complementary in order to obtain stable 
hybridization will vary with a number of factors, including nucleotide sequence, salt 
concentration of the solution, temperature, and pH. 

"Stringent conditions" as used herein, means conditions in which stable 
hybridization of complementary oligonucleotides is maintained, but mismatches are 
not (Sambrook et al., Molecular Cloning (1989), see for example, 11.46; RNA 
hybrids and 9.51, RNA:DNA hybrids. Preferably, stringent condition means 
incubation at 25-65°C in 1-6X SSC. More preferably, stringent condition means 
incubation at 42-65°C in 4-6X SSC. 

"A probe" as used herein, means an oligonucleotide or a vector containing 
a tag or tag-derived (that is, derived from all or part of a tag) sequence, used to 
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hybridize to a pool of RNA or DNA and detect nucleic acids of interest by any of 
a variety of methods known to those skilled in the art. 

"A vector" as used herein, means an agent into which DNA of this invention 
can be inserted by ligation into the DNA of the agent allowing replication of both 
the insert and agent in a suitable host.cell.^ Examples of classes of vectors can be 
plasmids, cosmids, and viruses (e.g., bacteriophage). A cloning vector is used for 
cloning DNA sequences comprising a tag sequence to form a cDNA tag library. 
More preferably, as a cloning vector, pUC18 and pUC19 are used. Preferably, the 
endogenous recognition sites for a punctuating restriction endonuclease within the 
cloning vector have been destroyed by site-directed mutagenesis. A sequencing 
vector is used to clone tags or arrays of tags in preparation for DNA sequence 
analysis. As a sequencing vector. pUC18 and pUC19 are preferred. 

This invention provides a method of obtaining a tag comprising the steps of: 

(a) providing a double-stranded cDNA, 

(b) cleaving the double-stranded cDNA with a punctuating 
restriction endonuclease providing a cDNA fragment, 

(c) ligating to the cDNA fragment a 5' adapter which is blunt or 
preferably contains a single^stranded overhang compatible with the 
punctuating restriction endonuclease. Such ligation produces a first cDNA 
construct in which the recognition sequence for the punctuating restriction 
endonuclease is regenerated. The 5' adapter also contains a recognition 
sequence for a type lis restriction endonuclease which allows DNA 
cleavage at a site in the cDNA fragment distant from the recognition 
sequence for the type lis restriction endonuclease, 

(d) cleaving the first cDNA construct with the type lis restriction 
endonuclease providing a second cDNA constoict. preferably this construct 
has 10-14 base-pairs of cDNA-derived sequence and is flanked at its 3* end 
by a random single-stranded overhang and at its 5' end by the recognition 
sequence of the punctuating enzyme as well as additional sequence 
derived from the 5' adapter, 
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(e) ligating to the second cDNA construct a 3' adapter, where the 
adapter has overhangs, it contains degenerate single-stranded overhangs 
compatible with all possible overhangs present in the first cDNA constaict. 
The 3' adapter also contains a recognition sequence for the punctuating 
restriction endonculease which is preferably located immediately proxirnai 
to the degenerate single-stranded end if used. Hence, ligation of the 3' 
adapter to the first cDNA constmct generates a cDNA construct in which a 
cDNA-derived sequence of 10-14 bases is flanlced at both ends by the 
recognition sequence for the punctuating restriction endonuclease, as well 
as additional sequence located at either end, providing a third cDNA 
construct, 

(f) digesting the third cDNA construct with a 5* and 3' cloning 
restriction endonuclease to provide a fourth cDNA construct which is ligated 
into a like digested cloning vector (like digested" means digesited to provide 
ends which can be ligated to the other ends of the vector or constaict ) and 
amplified by growth in a suitable host to form a tag library, and . 

(g) optionally isolating vector DNA from the tag library and 
digesting the DNA with the punctuating restriction endonuclease to release 
the tag from the vector, and 

(h) detemiining the nucleotide sequences of the tag(s). 
Preferably, this invention provides a method of obtaining an an-ay of tags, 

comprising the steps of: 

(a) providing double-stranded cDNA from an mRNA using a 
biotinylated oligo dT primer, 

(b) cleaving the double-stranded cDNA with a punctuating 
restriction endonuclease which cleaves within the cDNA 
providing a population of a cDNA fragment, 

(c) ligating to the cDNA fragment a 5' adapter comprising, 
i) a single-stranded end compatible with ends produced 

by cleavage with the punctuating endonuclease; 
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ii) a recognition sequence for a type lis restriction 
endonuclease located 5* to the single-stranded end; 

jii) a recognition sequence for a 5* cloning restriction 
endonuclease located 5' to a recognition sequence for the type, lis 
restriction endonuclease, providing a first cDNA construct, 

(d) isolating the first cDNA construct by affinity capture (sucha as 
chromatography, loose beads, magnetic media or the like) on preferably 
solid phase streptavidin. 

(e) cleaving the first cDNA construct with the type Us restriction 
endonuclease from the solid phase and/or streptavidin providing a second 
cDNA construct, 

(f) iigating to the second cDNA construct a 3' adapter comprising, 

i) a degenerate single-stranded end compatible with any 
ends produced by the type lis restriction endonuclease; 

ii) a recognition sequence for the punctuating 
endonuclease located 3' to the degenerate end; 

iii) a recognition sequence for a 3' cloning restriction 
endonuclease located 3' to a recognition sequence for the 
punctuating endonuclease. providing a third cDNA construct. 

(g) digesting the third cDNA constaict with the third and 3' cloning 
restriction endonucleases providing a fourth cDNA construct; 

(h) inserting the fourth cDNA construct into a cloning vector 
digested with the third and 3* cloning restriction endonucleases, 

(i) replicating the vector DNA in a suitable host strain, 
0) isolating the vector DNA, 

(k) digesting the vector DNA with the punctuating endonuclease 
providing a tag comprising cDNA sequences and GC rich clamps, 

(I) Iigating the tags providing arrays of tags comprising at least 
10 tags and GC rich clamps, 

(m) optionally inserting the an^ays of tags into a sequencing 
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vector, 

(n) optionally determining the nucleotide sequences of the arrays 
of tags. 

Double-stranded cDNA is prepared from the target mRNA pool by standard 
5 methods using oligo-dT primer. The oligo-dT primer is preferably biotinylated. 
Preferably, the double-stranded cDNA is treated with methylase or other protection 
means for a S^ncloning restriction endonuclease and/or a 3 -cloning restriction 
endonuclease to protect any internal endonuclease recognition sites. 

Double-stranded cDNA is cleaved with a punctuating endonuclease under 
10 any known conditions providing a cDNA fragment The punctuating restriction 
endonuclease cleaves within the cDNA fragment. 

A synthetic, double-stranded adapter molecule with a single-stranded 
overhang compatible with the punctuating restriction endonuclease, is ligated to 
the cDNA fragment. The 3 - most cDNA fragment is then isolated by affinity 
15 capture, preferably such as biotin and streptavidin, on solid phase which is 
extensively washed to remove free 5' adapter providing a first cDNA construct. 
The 5' adapter introduces a recognition sequence for a type lis restriction 
endonuclease, preferably Bsgl; Immediately 5' to the ligated cDNA fragment. 
Preferably, the 5* adapter contains a recognition sequence for a 5' cloning 
20 restriction endonuclease, at its 5' terminus to facilitate later cloning. 

Cleavage of the adapter-ligated. preferably solid-phase bound cDNA 
fragment with the type lis restriction endonuclease releases into the soluton phase 
a linear DNA fragment consisting of the adapter itself and additional nucleotides 
of unknown cDNA sequence separated from the adapter by the punctuation 
2 5 sequence providing a second cDNA construct. 

A second cDNA construct is then ligated to a 3* adapter molecule which, if 
an overhang, such as a two base overhang, is used, would have a 16-fold 
degenerate overhang at the 5' end of the 3* adapter which renders it compatible 
with all possible cDNA overhang sequences released by the type lis restriction 
30 endonuclease, providing a third cDNA construct. Preferably, the 3* adapter 



wo 00/53806 PCT/1B99/00502 

15 

contains a recognition sequence for a 3 -cloning restriction endonuclease. This 
adapter introduces a recognition sequence for the punctuating restriction 
endonuclease to the 3' end of the second cDNA constnjct, such that the constnJct 
contains a cDNA<lerived "tag" sequence flanked at both ends by punctuation 
5 sequence produced by a 5' and a 3' adapter. 

The third cDNA construct is digested with the punctuating endonuclease 
under conditions known to person skilled in the art, thus providing a tag. Digestion 
with the punctuating endonuclease provides a tag which comprises cDNA 
sequences with a recognition sequence for a punctuating endonuclease at its 

10 ends. The resulting tag can be inserted for example, into a cloning vector to 
amplify in microorganisms. After amplification, the tag sequence is determined by 
any known method. 

Alternatively, instead of digesting a third cDNA construct with the 
punctuating restriction endonuclease, the resulting third cDNA construct is 

15 digested with the 5* and 3 -cloning restriction endonucleases providing a fourth 
cDNA construct. The third cDNA construct can be digested initially with either of 
5' or 3' restriction endonuclease or digested with both restriction endonucleases 
simultaneously. In this case the fourth cDNA construct is isolated by any known 
method including gel electrophoresis or the like, to resolve it from dimers of the 

20 . adapters which are also formed in the ligation reaction. These manipulations result 
in a 5' and -3'-cloning restriction endonuclease-tailed DNA fragment containing a 
cDNA tag flanked at both ends by the punctuation sequence. The resulting fourth 
cDNA construct is isolated from the isolation means, or resolving means by known 
methods, such as eluting from a gel and recovery by ethanol precipitation. 

25 Before inserting the fourth cDNA construct into a cloning vector (digested 

with the 5' and 3'-clonong restriction endonuclease). it is preferred that any 
endogenous punctuating endonuclease restriction sites in the vector have been 
removed by site-directed mutagenesis. As a cloning vector. pUC18 or pUC19 are 
preferably used. The cloning vector is digested with a 5* and 3 -cloning restriction 

30 endonuclease and a fourth cDNA construct is inserted into the cloning vector. 
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The cloning vectors are replicated using any method known in the art. 
Preferably, the cloning vector comprising cDNA constoict is amplified in a host cell 
such as, but not limited to. E. coli by first transfonning E. coli with the vector 
comprising cDNA construct, growing the transformed cells, and isolating the 
5 cloning vector from cell culture. 

Preferably, cultured host cells are collected by centrifugation and then 
plasmid DNA is prepared from the precipitate using any known procedures to 
isolate the vector DNA. 

The plasmid DNA is digested with the punctuating restriction endonuclease 
10 to release the tags. Each tag is a DNA fragment consisting of a 10-14 base-pair 
sequence derived from the cDNA. The resulting tag is flanked at both ends, 
preferably by compatible single-stranded overhangs, which are derived from 
recognition sequence for a punctuating endonuclease. When the punctuating 
endonuclease is Mspl, tags have GC single-stranded 3' overiiang and CG single- 
15 stranded 5' overhangs. A GC clamp prevents the melting of tags at ambient 
temperatures and attendant bias against AT-rich sequences. The tag fragments 
are isolated away from the plasmid backbone by acrylamide gel electrophoresis, 
eluted from the gel and recovered by ethanol precipitation in preparation. 

Tags are ligated together via their compatible ends to form arrays of tags. 
2 0 The arrays of tags are isolated by agarose gel electrophoresis. 

Aaays of tags are inserted into a sequencing vector in preparation for DNA 
sequence analysis. As a sequencing vector, any vector is useful. Preferably, 
pGEM®(Promega Corp., Madison, Wl). pBluescript® (Stratagene, La Jolla, CA), 
pUC18 or pUC19 are used. More preferably, pUC19 can be used. Each array 
25 consists of preferably, 10-14-base pair tag sequences separated from each other 
and from the plasmid backbone by the defined 4-base punctuation sequence. 

Any known procedures are used for sequencing analysis to detennine the 
nucleotide sequences of the tag or the arrays of tags. 

This method allows mRNAs with a number of copies to be detected in a 
30 given cell population. By comparing gene transcription profiles among cells, this 
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method can be used to identify individual genes whose transcription is associated 
with a pathological phenotype. 

Using high throughput DNA sequencing, the method of this invention also 
permits the generation of a global gene transcription profile. Thus, this invention 
provides a simple and rapid method of obtaining sufficient data to use in an 
information system known to those of skill in the art to obtain a global gene 
transcription profile and identify genes of interest. 

Accordingly, this invention can be used to identify differential gene 
transcription pattems among two or more cells or tissues. Thus, using the methods 
of this invention one can identify a gene or genes that are transcribed in any given 
cell type, tissue, or target organism at a different level from that in another cell 
type, tissue, or target organism. 

The methods of this invention can be used to identify differential gene 
transcription pattems at different stages of development in the same cell-fype or 
tissue-type, and to identify changes in gene transcription patterns in diseased or 
abnormal cells. Further, this invention can be used to detect changes in gene 
transcription pattems due to changes in environmental conditions or to treatment 
with dmgs. To do so, pattems of gene transcription are compared using double- 
stranded cDNA obtained from different mRNA populations of interest. 

This invention also provides a method of identifying pattems of gene 
transcription, comprising steps of: 

(a) providing tags from sources of interest according to this invention, and 

(b) identifying pattems of gene transcription. 

Tags are prepared from an mRNA population of interest according to this 
invention and their sequences are determined using conventional procedures. 

Sequences of resulting tags are compared to known 
sequence databases to identify patterns of gene transcription. 

This invention also provides a method of detecting a difference in gene 
transcription between two or more mRNA populations, comprising steps of: 

(a) identifying patterns of gene transcription from a first mRNA 
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population according to this invention. 

(b) identifying patterns of gene transcription from a second mRNA 
population according to this invention, and 

(c) comparing the patterns of gene transcription from (a) and (b). 

5 Preferably, the first mRNA population is obtained from a normal cell or 

tissue. Patterns of gene transcription from a first mRNA population are then 
identified. Preferably, the second mRNA population and/or any additional mRNA 
population is obtained from a target organism having a disease or disorder, cells 
or tissues at different developmental stages, different tissues or organs of the 

10 same target organism or different target organisms and pattems of gene 
transcription are identified. 

The patterns obtained from the first and second mRNA populations are 
compared and the difference is observed. In addition, pattems from other mRNA 
populations can be compared to those initially derived. This method is also useful 

15 in identifying genes modulated by development, disorders, drugs, stress, disease 
or the like. 

This invention also provides a method of detemriining the relative frequency 
of a particular gene's transcription compared to other genes transcribed into an 
mRNA population comprising the steps of: 
20 (a) providing an aray of tags of interest, 

(b) sequencing the array of tags, and 

(c) determining relative frequency of any or all tags. 

An array of tags is prepared based on this invention from cDNA library from 
an mRNA population. 

25 The array of tags can be sequenced by using any known methods, such as 

sequencing by hybridization method, (see for example US Patent 5.202,231, 
hereby incorporated by reference). 

Once sequencing of tags is accomplished, determining the frequency of 
tags is done by any method available. For example, this can be done manually or 

30 using a suitable algorithm and/or a computer searchable database. 
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This invention also provides a method of screening for a disease, a 
disorder, or the like stress as defined above, including the effects of a drug on a 
cell or tissue comprising the steps of: 

(a) identifying patterns of gene transcription in the normal cell according 
5 to this invention. 

(b) identifying patterns of gene transcription in the presence of a stress 
or the like according to this invention, 

(c) comparing the patterns of gene transcription from (a) and (b). 

For example, the differences in the patterns of gene transcription between 
1 0 cells cultured with the drug and those without the drug is compared to detemiine 
whether the drug changes the gene transcription profile. This method yields 
information on (1) markers useful in diagnosis of disease or other stress as defined 
above, such as by blood test, the like, and/or (2) determining target enzymes or 
proteins for treatment and thus providing an aid in drug design or development. 
15 This invention also provides a method of detecting the presence of a 

disease, or other stress as defined above in a target organism comprising the 
steps of: 

(a) providing the tag sequence of a gene that is differentially expressed 
(either expressed more abundantly-increased expression, or expressed less 

20 abundantly-decreased expression, that is the disease or other stress as defined 
above modulates expression in some way) in a normal cell or tissue according to 
this invention. 

(b) hybridizing a cDNA library obtained from a first target organism with 

the tag. 

25 (c) hybridizing a cDNA library obtained from a second normal or 

diseased (or effected by other stress as defined above) target organism with the 
tag sequence, and 

(d) comparing the level of transcription of the gene in the first target 
organism with the level of transcription of the gene in the second target. 

30 Any known methods are employed to detect the presence of a disease or other 
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Stress as defined above in the first target organism to compare the level of 
transcription of the gene in the first target organism with the level of transcription 
of the gene in the second target organism. 

This invention also provides a method of isolating a gene, comprising the 
5 steps of: 

(a) providing a probe comprising a tag sequence of interest according 
to this invention, 

(b) probing cDNA library of interesti and 

(c) isolating a gene. 

10 Any tag of interest can be used to provide a probe comprising a tag 

sequence of interest This probe can be used to find a new gene, detect a gene 
in a cell, or detect a mutation. A probe comprising a tag sequence is prepared 
using synthetic oligonucleotide or a vector comprising a tag sequence. A probe 
is preferably labeled for detection by radioisotope, fluorescence and the like, or for 

15 isolation such as by biotin, streptavidin or the like. 

The nucleotide sequence of a tag is compared with known nucleotide 
sequences to detemnine which gene to isolate. Known nucleotide sequences can 
be obtained from any source using sequence databases, such as GenBank, etc. 
This invention also provides a kit for obtaining a tag or an an^ay of tags 

20 comprising: 

(a) a 5' adapter, 

(b) a 3' adapter 

(c) appropriate vectors, including cloning and/or sequencing vectors, 
and , 

25 (d) appropriate restriction endonucleases, as disclosed herein. 

The kit can further include reaction buffer 

and/or cDNA library and other such components. 

Hence, this invention provides a rapid and accurate means to quantitatively 
analyze the gene transcription profile of interest and to compare profiles between 
30 different sources for a host of reasons. 
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The method also assists in new gene discovery because the 10-14-bp tag 
sequences generated by this invention can serve as hybridization probes to 
facilitate the isolation of interesting tagged genes whose function is not yet known. 
Isolation of genes using such tags is well understood in the art. 
5 EXAMPLES 

To assist in understanding the present invention, the following Examples 
are included which describes the results of a series of experiments. The 
experiments relating to this invention should not, of course, be construed as 
specifically limiting the invention and such variations of the invention, now l<nown 
10 or later developed, which would be within the purview of one skilled in the art are 
considered to fall within the scope of the invention as described herein and 
hereinafter claimed. 

Trademari^s used herein are examples only and reflect illustrative materials 
used at the time of the invention. The skilled artisan will recognize that variations 
15 in lot. manufacturing processes, and the like, are expected. Hence the examples, 
and the trademarics used in them are non-limiting, and ttiey are not intended to be 
limiting, but are merely an illustration of how a skilled artisan may choose to 
perfonm one or more of the embodiments of the invention. 
EXAMPLE 1 PREPARATION OF A cDNATAG LIBRARY 
20 (a) cDNA SYNTHESIS 

Ten ^ig of polyA+ RNA from nomnal adult human lung (Clontech, Inc.. Palo 
Alto. CA) was primed with 5' biotinylated oligo dT(25) and copied into cDNA 
(Superscript II® cDNA synthesis kit from GIBCO Life Technologies. Gaithersburg. 
MD). The cDNA was phenol/chloroform extracted and ethanol precipitated. The 
25 pellet was dissolved in 50\i\ of buffer (1X NEB buffer #2 from New England Biolabs 
(NEB), Tozer, MA) containing 200 units of Mspl (NEB) and digested for 1 hour at 
37^C followed by 20 minutes at 65®C to inactivate the enzyme. 

The reaction mix was brought to 800 jil in magnetic bead binding buffer, 
added to 600 |xl of streptavidin magnetic beads (Dynal, Inc.. Lake Success, NY), 
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which were previously washed and from which all wash buffer was removed. The 
total volume of this solution was 800 jil. This solution was incubated overnight at 
room temperature with gentle rotation. Beads were washed 5 times using a 
magnetic capture device with 10 mM Tris-HCl. pH 7.6 followed by a final wash in 
5 0.5X T4 ligase buffer (GIBCO). 

(b) ADAPTER LIGATION AND LIBRARY GENERATION 
Oligonucleotide adapters (5' adapter) were created by mixing synthetic 
oligos, heating them to QS^'C, and allowing them to cool slowly to room 
temperature. Sense strand of this oligonucleotide adapter has a sequence of SEQ 

10 ID NO; 5 and antisense strand of this oligonucleotide adapter has a sequence of 
SEQ ID NO. 6. 7.6 fig of 5' adapter DNA was added to the magnetic bead mix 
In a volume of 100 jil of 0.5X TA ligase buffer containing 50 units of T4 ligase 
(GIBCO) and incubated 2 hours at room temperature. Beads were then washed 
5 times in 10 ml of 10 mM Tris-HCl, pH 7.6. The beads were resuspended in 4 

15 aliquots of SOOji! each and incubated at 65°C for 20 minutes. The beads were then 
washed 5 times in 10 ml of 10 mM Tris-HCl. pH 7.6. Beads were then 
resuspended in 360 jil of NEB buffer 4 containing 80 \iM SAM (5- 
adenosylmethionine). 

The beads were divided into 4 aliquots of 90ul each and 40 units Bsgl 

20 (NEB) was added to each aliquot The beads were then incubated at 37*^0 for 1 .5 
hours. The tag-containing supematants were pooled. This solution was extracted 
with phenol/chloroform and ethanol precipitated with 20 fig of mussel glycogen 
carrier. 

A second, 16-fold degenerate adapter (3* adapter) molecule was prepared 
25 by annealing synthetic oligos. Sense strand of this 16-fold degenerate adapter has 
a sequence of SEQ ID NO. 7 and antisense strand of this 16-fold degenerate 
adapter has a sequence of SEQ ID NO. 8. 

7.9 of 3' adapter was added to the tag DNA pellet in a total volume of 30 
jil of ligase buffer containing 10 units T4 DNA ligase and incubated for 2 hours at 
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room temperature. The ligase mix was heat inactivated by incubation at eS^'C for 
20 minutes. 

The reaction mix was loaded on a 15% TAE (IX TAE; 40mM Tris-Acetate, 
1mM EDTA) acrylamide gel to resolve the 68 bp fragments containing cDNA tags 
from multimers of the adapter. cDNA tag fragments were excised from the ethidium 
bromide(EtBr)-stained gel, eluted overnight in 500 >l of TE (lOmM Tris, 1mM 
EDTA), and ethanol precipitated with 20 jxg mussel glycogen canier. Tag DNA 
was quantified by spectrophotometer and ligated overnight at le'^C into the 
dephosphorylated EcoRI and NotI sites of a vector of pUC19 (Bayou Biolabs. 
Harahan, LA) In which endogenous Mspl sites were destroyed by site-directed 
mutagenesis at a 3:1 insert: vector ratio. The ligation mix was transfomied into 
competent cells (XL-10 Gold® from Stratagene, La Jolla, CA) which were grown in 
5L of LB medium containing 100 ^g/ml ampicillin. 

(c) TAG ISOLATION AND CONCATEMER FORMATION 
Transformed bacteria were recovered by centrifugation and plasmid DNA 
was isolated (Mega Plasmid Prep Kit from Qiagen, Inc.. Valencia, CA). One mg 
of plasmid DNA was digested with 5000 units Mspl in a total volume of 1 ml at 
20°C, for 1 hour. This reaction was loaded into 12 lanes of a 15% TAE acrylamide 
gel. Tag fragments were excised from the EtBr-stained gel, elutied overnight in 
approximately 1ml of TE and ethanol precipitated with 20^g mussel glycogen 
cam'er. 

Purified tags were resuspended In 20 jxl ligase buffer containing 10 units T4 
ligase, and incubated at room temperature for 1 .5 hours. This was followed by an 
additional 20 minute incubation with 5 units of Klenow fragment and 2 mM dNTPs. 
The reaction was then loaded onto aA% TAE agarose gel containing EtBr and 
electrophoresed. Nucleic acids of 600-1200 bp in length were isolated from the 
gel, ethanol precipitated and resuspended in 10)il of ligation buffer containing 50 
ng Smal-cut. alkaline phosphatase-treated pUC19. and 5 units of T4 ligase, and 
incubated overnight at 16°C. One fxl of the ligation mix was transformed into 100 
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^l of competent cells (XHO Gold® from Stratagene). The reaction mix was plated 
on LB-Amp plates with IPTG-Xgal and individual white colonies were picked for 
DNA sequence analysis. 

(d) DNA SEQUENCING 
5 An-ays of -600 bp or greater were purified by agarose gel electrophoresis 

and cloned into pUC19. Sequencing-grade plasmid templates from about 700 
independent colonies were prepared in 96-well plates (Qiagen BioRobot 9600). 
Templates were subjected to cycle sequencing (PE-ABI BigDye*' ternriinator 
chemistry from PE Applied Biosystems, Forster City, CA) with the M13 reverse 

10 primer. Reactions were run on automated sequencers (ABI 377 automated 
sequencers). Extracted data were ported to a SyBase database and subjected to 
automated DNA sequence analysis (BioLIMS® Sequencing Analysis, ver. 3.1.3 
system from PE-ABD). 

A frequency distribution of tags was generated and searched against a 

15 database to generate a small transcription profile. The profile contained 14,496 
unambiguous tag sequences representing 2560 independent genes. The number 
of identifiable tags in each an-ay ranged from 2 to 63 with an average of 32 tags 
peran-ay. 

Most of the abundant tags corresponding to identifiable genes in this profile 
20 have been previously described as being highly expressed either in the lung (Itoh. 
K. et al., DNA Res. 1 : 279 (1 994)) or in most human tissues (Adams. M.D. et al.. 
Nature 377: (SuppL) 3 (1995)). 

EXAMPLE 2 PREPARATION OF cDNA TAG LIBRARY (2) 
(3) cDNA SYNTHESIS 
25 Ten ^ig of polyA+ RNA from nonmal adult human lung (Clontech. Inc.) was 

primed with 5' biotinylated oligo dT(25) and copied Into cDNA (Superscript II® 
cDNA synthesis kit from GIBCO). The cDNA was ethanol precipitated. The pellet 
was dissolved in SOpil of EcoRI Methylation Buffer (NEB) containing BOpiM S- 
adenosylmethionine (SAM) (NEB) and 320 units of Eco/?l Methylase (NEB) and 
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incubated for 1 hour at 37^C followed by 1 5 minutes at es^'C to inactivate the 
enzyme. The reaction mix was brought to 800 fil with water and spun through a 
spin filter (Microcon 50. Amicon Inc., Beverly, MA) until 60ul was retained, 

Mspl digestion was perFonned in SOfil of buffer (1X NEB buffer #2) 
containing 10mM MgClj and 400 units of Mspl (NEB) and digested for 2 hours at 
37°C followed by 20 minutes at 65**C to Inactivate the enzyme. The reaction mix 
was brought to 880 ^1 with water and spun through a spin filter (Microcon 30, 
Amicon) until about iBOfil was retained. 

(b) ADAPTER LIGATION AND LIBRARY GENERATION 
Oligonucleotide adapters (5' adapter) were created by mixing synthetic 
oligos, heating them to 95°C, and allowing them to cool slowly to room 
temperature. Sense strand of this oligonucleotide adapter has a sequence of SEQ 
ID NO. 1 and antisense strand of this oligonucleotide adapter has a sequence of 
SEQ ID NO. 2. 

54 \ig of 5' adapter DNA was added to the Mspl-digested qDNA in a volume 
of 150 ^il of 1X T4 DNA ligase buffer containing 37.5 units of T4 DNA ligase 
(GIBCO) and incubated 2 hours at room temperature. 150jil of 10mM Tris, 1mM 
EDTA, 1M NaCI was added to the reaction followed by 20 minutes at 65*C to 
inactivate the enzyme. The 300jil reaction mix was added to 900 |il of streptavidin 
magnetic beads (Dynal, Inc.). which were previously washed and from which all 
wash buffer had been removed. This solution was incubated overnight at room 
temperature with gentle rotation. Beads were washed 5 times in 10 mi each of 10 
mM Tris-HCI, pH 7.6, using a magnetic capture device. The beads were 
suspended in 2ml 5mM Tris-pH 7.6 with 250mM NaCI and incubated at 65°C for 
20 minutes to further inactivate the enzyme. The beads were then washed 5 times 
in 10 ml each of 10 mM Tris-HCI, pH 7.6. Beads were then washed once with 1ml 
of IX NEB buffer 4 containing 80 SAM. 

The beads were suspended in 200^1 of IX NEB buffer 4 containing 80|.iM 
SAM and 100 units of Ssgl (NEB). The beads were incubated at 37°C for 2 hours 
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with gentle rotation. The tag-containing supernatant was saved and the beads 
were washed two times with 200^1 10mM Tris-pH7.6. The washes and 
supernatant were pooled and 30^x1 5M NaCI was added. The pooled supernatant 
and washes were then incubated at 65**C for 20 minutes to inactivate the enzyme. 
5 This solution was spun over a Microcon 10 (Amicon) spin filter until the volume 
was about 30 jil 

A second, 16-fo|d degenerate adapter molecule (3* adapter) was prepared 
by annealing synthetic oligos. Sense strand of 16-fold degenerate adapter has a 
sequence of SEQ ID NO. 3 and antisense strand of 16-fold degenerate adapter 
10 has a sequence of SEQ ID NO. 4. 3* adapter has 1 or more biotinylated 
phosphoamidites incorporated on the 3' end of the oligo during oligo synthesis. 

10.8 ^g of 3' adapter was added to the tag DNA in a total volume of 45 ^il 
of T4 DNA ligase buffer containing 10 units T4 DNA ligase and incubated for 2 
hours at room temperature. The ligase mix was heat inactivated by incubation at 
15 65^C for 20 minutes. 

The tag DNA was digested with NotI in a total volume of 250^1 1X React 3 
buffer (Gibco) containing 1X bovine serum albumin (BSA) and 625 units of NotI 
(Gibco) and incubated at 37°C overnight. 

The NotI digestion reaction was digested with EcoRI in SOOfxl of IX React 
20 3 buffer with 1250 units of EcoRI and incubated at 37°C for 2 hours followed by 
addition of 5jil 0.5M EDTA and 90^1 5M NaCI. The reaction was heat inactivated 
by incubation at 65^0 for 20 minutes. 

The 500^1 reaction mix was added to 300 ^1 of streptavidin magnetic beads 
(Dynal, Inc.). which were previously washed and from which all wash buffer had 
25 been removed. This solution was incubated overnight at room temperature with 
gentle rotation. 

The supernatant from binding was collected and spun on a spin filter 
(Microcon 10, Amicon) until the retained volume was about 30)il. 

The concentrated reaction mix was loaded on a 10% TAE acrylamide gel 
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to resolve the 93 bp fragments containing cDNA tags from multimers of the 
adapter. cDNA tag fragments were excised from the EtBr-stained gel. eluted 
overnight In 300 ^li of TE, and ethanol precipitated with 15 ^ig GLYCOBLUE 
(Ambion Inc., Austin, TX) carrier. Tag DNA was quantified by spectrophotometer 
5 and ligaited overnight at 16**C into the dephosphorylated EcoR1 and NotI sites of 
a vector of pUC1 9 in which endogenous Mspl sites were destroyed by site-directed 
mutagenesis at a 1 :4 insertivector ratio. The ligation mix was transformed into 
competent cells (XL-10 Gold® from Stratagene, La Jolla, CA) which were grown in 
2L of LB medium containing 100 fig/ml ampicillin. 

10 fc) TAG ISOLATION AND CONCATEMER FORMATION 

Transformed bacteria were recovered by centrifugation and plasmid DNA 
was isolated (Mega Plasmid Prep Kit from Qiagen. Inc.). One mg of plasmid DNA 
was digested with 5000 units Mspl in a total volume of 1 ml at 20^*0. for 1 hour. 
This reaction was loaded into 12 lanes of a 15% TAE acrylamide gel. Tag 

15 fragments were excised from the EtBr-stained gel, eluted overnight in 
approximately 1ml of .TE and ethanol precipitated with 15fxg GLYCOBLUE 
(Ambion) carrier. 

Purified tags were resuspended in 20 (xl ligase buffer containing 10 units T4 
DNA ligase, and incubated at room temperature for 1 .5 hours. This was followed 

20 by an additional 20 minute incubation with 5 units of Klenow fragment and 2 mM 
dNTPs. The reaction was then loaded onto a 1 % TAE agarose gel containing EtBr 
and electrophoresed. 

Nucleic acids of 600-1200 bp in length were isolated from the gel, ethanol 
precipitated and resuspended in 10^1 of ligation buffer containing 50 ng Smal-cut, 

25 alkaline phosphatase-treated pUC19, and 5 units of T4 DNA ligase, and incubated 
overnight at le'^C. One ^il of the ligation mix was transformed into 100 ^1 of 
competent cells (XL-10 Gold® from Stratagene). The reaction mix was plated on 
LB-Amp plates with lsopropyl-1-thio-B-D-galactopyranoside (IPTG. Stratagene) 
and 5-bromo-4-chloro-3-indolyl-B-D-galactopyranoside (Xgal, Stratagene) and 
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individual white colonies were picked for DNA sequence analysis. 
(d) DNA SEQUENCING 
. Arrays of --600 bp or greater were purified by agarose gel electrophoresis 
and cloned into pUC19. Sequencing-grade plasmid templates were prepared in 
5 96-well plates (Qiagen BioRobot 9600). Templates were subjected to cycle 
sequencing (PE-ABI BigDye* temiinator chemistry from PE Applied Biosystems. 
Forster City, CA) with the M13 reverse primer. Reactions were run on automated 
sequencers (ABI 377 automated sequencers). Extracted data were ported to a 
SyBase database and subjected to automated DNA sequence analysis (BioLIMS* 
10 Sequencing Analysis, ver. 3.1 ,3 system from PE-ABD). 

EXAMPLE 3 COMPARISON OF A cDNA TAG LIBRARY TO A STANDARD 
cDNA LIBRARY 

To characterize a tag as conresponding to a gene, we produced a 
standard oligo dT-primed cDNA library of about 1,000,000, primary plaques in 
15 lambda-gtl 0 (Huynh. T.V. et al.. DNA Cloning: A Practical Approach, D. Glover, 
Ed. (IRL Press, Oxford) (1984)). 

A series of replicate nitrocellulose filter lifts were prepared from large plates 
containing -18,000 plaques. These filters were probed with end-labeled 
oligonucleotides corresponding to identified tag sequences. The frequencies of 
20 probe hybridization match well with the con-esponding tag frequencies in the - 
profile. Clones hybridizing to the tag were isolated and subjected to DNA 
sequence analysis. 

This analysis confirmed the identity of the tagged gene and Its relative 
expressed abundance compared to a cDNA library. 
25 EXAMPLE 4 DETERMINATION OF PCR AMPLIFICATION BASED 
METHOD ERRQR.q ON QUANTITATION 

Two pairs of complementary oligonucleotides conresponding to synthetic 
amplicons were synthesized and annealed. Each amplicon contains a single di-tag 
of 18-bases in length flanked by anchoring enzyme sequences (CATG) and PCR 
30 priming sites exactly as described by Velculescu et al. Science, 270: 484 (1 995). 
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One of the tags in each annplicon is an identical 9-base arbitrary sequence 
(CTG7TAGTA), This common tag sequence was paired with either an AT-rich 
non-palindromic "tag" sequence (TATAATAAA for amplicon 1) or a GC rich 
palindromic sequence (CCCGATCGG for amplicon 2) to fomi artificial di-tags. The 
5 entire double-stranded synthetic amplicon terminates in single-stranded ends 
compatible with BamH1 and Hindlll digested vectors to facilitate cloning. 

These artificial amplicons were ligated into a plasmid vector, and a precisely 
equivalent concentration of plasmid DNA was prepared from each. This plasmid 
DNA was diluted and subjected to either 15 or 20 cycles of amplification using 6- 

10 FAM (6*carboxyfluorescein) end-labeled primers and amplification conditions 
described In Velculescu et al., Science, 270: 484 (1995). 

The intensity of the bands derived from the corresponding PGR products 
was not identical. After 15 cycles of amplification, there was no visible PGR 
product derived from amplicon 2 while the band derived from amplicon 1 was 

15 clearly evident When the cycle number was increased to 20, both PGR products 
were visible, but peak area analysis (PE Biosystems Prism 377 Genescan™ 
software from PE, Santa Fe. N.M.) demonstrated that the quantity of product 
derived from amplicon 1 was more than 5 times that of amplicon 2, despite the fact 
that there was no difference in starting template concentration or PGR primers. 

20 The clear implication is that the amount of PGR product produced after 
amplification can be dramatically influenced by the sequence of any tag within it. 

All references cited herein are hereby incorporated by reference, whether 
specifically stating that they are incorporated by reference or not. as if fully set forth 
2 5 for the matter they contain. 
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WE CLAIM 

1 . A method of obtaining a tag comprising the steps of: 

(a) providing a double-stranded cDNA, 

(b) cleaving said double-stranded cDNA with a punctuating 
5 restriction endonuclease providing a cDNA fragment. 

(c) ligating to said cDNA fragment a 5'. adapter comprising a 
recognition sequence for a type lis restriction endonuclease which allows 
DNA cleavage at a site in the cDNA fragment distant from the recognition 
sequence for a type Ms restriction endonuclease providing a first cDNA 

10 construct, 

(d) cleaving said first cDNA construct with said type lis restriction 
endonuclease providing a second cDNA construct, 

(e) ligating to said second cDNA construct a 3' adapter, thereby 
producing a recognition sequence for a punctuating endonuclease located 

15 3' adjacent to said second cDNA construct, providing a third cDNA 

construct, 

(f) digesting said third cDNA construct with said punctuating 
- restriction endonuclease providing a tag comprising cDNA sequences, and 

(g) determining the nucleotide sequences of said tag. 

2. The method of claim 1. wherein the punctuating restriction 
endonuclease is Mspl and the type lis restriction endonuclease is Bsgl. 

3. The method of claim 1 further comprising the step of ligating tags 
providing an array of tags. 

4. The method of claim 3 further comprising the step of introducing said 
array of tags into a vector. 

5. A method of obtaining an array of tags, comprising the steps of: 

(a) providing double-stranded cDNA from an mRNA using a 
biotinylated oligo dT primer, 

(b) cleaving within said double-stranded cDNA with a punctuating 
5 restriction endonuclease providing a population of a cDNA fragments, 
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(c) ligating to said cDNA fragment a 5' adapter comprising, 

i) a single-stranded end compatible with ends produced 
by cleavage with said punctuating restriction endonuclease; 

ii) a recognition sequence for a type lis restriction 
endonuclease located 5' to said single*stranded end; 

iii) a recognition sequence for a 5 -cloning restriction 
endonuclease located 5' to a recognition sequence for sai(j.type lis 
restriction endonuclease, providing a first cDNA construct, 

(d) isolating said first cDNA construct by afUnity capture on solid 
phase streptavidin, 

(e) cleaving said first cDNA construct with said type lis restriction 
endonuclease providing a second cDNA construct, 

(f) ligating to said second cDNA construct a 3' adapter 

comprising, 

i) a degenerate single-stranded end compatible with 
ends produced by said type lis restriction endonuclease; 

ii) a recognition sequence for said punctuating restriction 
endonuclease located 3' to said degenerate end; 

iii) a recognition sequence for a 3 -cloning restriction 
endonuclease located 3' to a recognition sequence for said 
punctuating restriction endonuclease, providing a thinj cDNA 
construct. 

(g) digesting said tfiird cDNA construct with said 5' and 3 -cloning 
restriction endonucleases providing a fourth cDNA constmct; 

(h) inserting said fourth cDNA construct into a cloning vector 
digested with said 5' and 3 -cloning restriction endonucleases, 

(i) replicating said vector DNA in a suitable host strain, 
0) isolating said vector DNA, 

(k) digesting said vector DNA with said punctuating restriction 
endonuclease providing a tag comprising cDNA sequences and a GC 
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clamp, 

(I) iigating said tags providing arrays of tags comprising at least 
10 tags and GC rich clamps, 

6. The method of claim 5, wherein the punctuating 
restriction endonuclease is Mspi. and the type lis restriction 
endonuciease is BsgL 

7. The method of claim 5, wherein the 5' cloning 
restriction endonuclease is EcoRI and the 3'-cloning restriction endonuclease is 
Notl. 

8. A method of identifying patterns of gene transcription, comprising 
steps of providing one or more tags, according to daim 1 , from sources of interest, 
and identifying patterns of gene transcription. 

9. A method of detecting a difference in gene transcription between two 
or more mRNA populations, by identifying patterns of gene ti-anscription according 
to claim 8, in more than one sample, and comparing patterns of gene transcription 
from a first mRNA population, to the patterns of gene transcription from another 

5 sample. 

10. A method of determining the relative frequency of gene transcription 
in an mRNA population comprising the steps of: 

(a) providing an array of tags according to claim 5. 

(b) sequencing the array of tags, and 

5 (c) determining relative frequency of tags. 
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11. A method of screening for a stress on a cell comprising the steps of: 

(a) identifying patterns of gene transcription in the absence of 
stress, or physiological stresses of interest, 

(b) identifying patterns of gene transcription in the presence of a 
stress of interest. 

(c) comparing the patterns of gene transcription from (a) and (b) 
according to claim 8. 

12. A method of detecting the presence of a stress in a target organism 
comprising the steps of: 

(a) providing the tag sequence of a gene that is differentially 
expriessed in a normal cell or tissue, 

(b) hybridizing a cDNA library obtained firom a first cell or tissue 
with said tag, 

(c) hybridizing a cONA library obtained from a second cell or 
tissue with said tag sequence, and 

(d) comparing the level of transcription of said gene in said first 
and second cell or tissue. 

13. A method of isolating a gene, comprising probing cDNA library with 
a probe comprising a tag according to claim 1 and isolating a gene. 

14. A kit for obtaining a tag or an array of tags comprising: 

(a) a 5'adapter. 

(b) a 3' adapter 

(c) a vector, and 

(d) one or more restriction endonucleases. 
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SEQUENCE LISTING 

<210> 1 
<211> 66 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence : sy the tic 
<400> 1 ' ' 

gccgaattcg aaacggccga tgtcttcagt cgacctgtat ggcccttagc attagggctg 
60 

tgcagc 
66 

<210> 2 

<211> 68 . , 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence : sythetic 
<400> 2 

cggctgcaca gccctaatgc taagggccat acaggtcgac tgaagacatc ggccgtttcg 
60 

aattcggc 

68 ■ 



<210> 3 
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<211> 62 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence : sythe tic 
<400> 3 

ccggagatct gcggccgctc gactgcaaga cgacagtcta acgcaaagga aaaggctaac 
60 

tg 

62 

<210> 4 

<211> 64 . ' 

<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence rsythetic 
<400> 4 

cagttagcct tttcctttgc gttagactgt cgtcttgcag tcgagcggcc gcagatctcc 
60 

ggnn , 

64 

<210> 5 . 
<211> 39 
<212> DNA 
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<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence : sy the tic 
<400> 5 

aattcagata aggcgcgccg atgtcttcat ttgtgcagc 
39 

<210> € 
<211> 37 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence : sy the tic 
<400> 6 

cggctgcaca aatgaagaca tcggcgcgcc ttatctg 
37 

<210> 7 ' 
<211> 16 
<212> DNA 

<2I3> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence :sythe tic 



<400> 7 
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ccggagttta aacagc 
IS 

<210> 8 
<211> 22 
<212> DNA 

<213> Artificial Sequence 
<220> 

<223> Description of Artificial Sequence : sytbetic 
<400> B 

ggccgctgtt taaactccgg nn 
22 
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